layerwise exploration-exploitation tradeoff
Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff
Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression \citep{simchi2020bypassing}, we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon H (as known as CMDP with H layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class \mathcal{M} containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only O(H \log T) calls to an offline density estimation algorithm (or oracle) across all T rounds. This number can be further reduced to O(H \log \log T) if T is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class.